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ffi^^^W lar g* heterogeneous data bases, 
^^ISfi!5^^tion retrieval through visual 
qu^mg^iowsing is supported by dynamic 
ta^cp^iics; the; process comprises the steps of: 
Inj ^P^?^S : 0P1> a complete taxonomy for 
^ptneyal; refining (F2) the retrieval through 
a ,selection v 6f subsets of interest, where the 
refi^hg is ' performed by selecting concepts 
in ,me ; taxonomy and combining them through 
Bodlean operations; showing . (F3) a reduced 
^^W:;^/ selected set; and further 
refl ^l!^ 4 ) tbe retrieval .through an iterative 
^^^pmo|.^-.iefeing *ad- showing steps. 
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DYNAMIC TAXONOMY PROCESS FOR BROWSING AND 
RETRIEVING INFORMATION IN LARGE HETEROGENEOUS DATA. 



The. present invention refers to a dynamic 
taxonomy process for browsing and retrieving 

iWHSSSt information in large heterogeneous data bases 

''-^ t. * 

v Information retrieval on this type of data 
bases (for example those available on the Internet) 
is nowadays a slow task, sometimes impossible to 

rfsir-w-iW-i'.* realize due to the enormous amount of data to be 
•' 5> ! .'lv-f >>' . • anal y 2ed ' and th at can be implemented with 

<' difficulty with the currently available tools . The 

;: . \yyy^v,:\- 

. /.O'v.,:^'.. Present Applicants developed for such purpose a 
process solving the above problems by an innovative 
. :• . use of taxonomies as. a structuring and information 

' -' f-f:'i% :: x^ ■ access tool.. 



Dynamic taxonomies are a model to conceptually 
describe and access large heterogeneous information 
bases composed of texts, data, images and other 



multimedia documents. 
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A dynamic taxonomy is basically a IS -A 
hierarchy of concepts, going . from the most general 
(topmost) to the most specific. A concept may have 
several fathers. This is a conceptual schema of the 
information base, i.e. the "intension". Documents 
can be freely classified under different concepts 
at different level of abstraction (this is the 
"extension") . A specific document is generally 
classified under several concepts. 

Dynamic taxonomies enforce the IS-A 
relationship by containment, i.e. the documents 
classified under a concept C are the deep extension 
of C, i.e. the recursive union of all the documents 
classified under C and under each descendant C< of 



C. 



In a dynamic taxonomy, concepts can be 
composed, through classical boolean operations. In 
addition, any set S of documents in the universe of 
discourse U- (defined as the set of all documents 
classified in the taxonomy) can be represented by a 
reduced taxonomy, s may be synthesized either by 
boolean expressions on concepts or by any other 
retrieval method, (e.g. "information retrieval"). 
The reduced taxonomy is derived from the original 
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taxonomy by pruning the COI1C£pts 
which no document d in s is classified. 

A new visual query /browsing approach is 
supported by dynamic taxonomies. The user is 
initially presented with the complete taxonomy. 
.. He/she can then refine the result by selecting a 
subset of interest. Refinement is done by selecting 
concepts in the taxonomy and combining them through 
•boolean operations. She/he win then be pre s ented 
.With a reduced taxonomy for the selected set of 
documents, which can be iteratively further 
.refined. 

lit:'-'^ inV6ntion Ascribed here covers the' 
' -following aspects of dynamic taxonomies: 
1. additional operations; 

2- abstract storage structures and operations on . 
,,su Gh structures for the intension and the 
extension; 

3, Physical storage structures, architecture and 
implementation of operations; 

5, definition, use and implementation of virtual 
concepts; 

5. definition, use anH 

use and implementation of time- 

. varying concepts; 

6. binding a dynamic taxonomy to a database system; 



ill 



•••>, ,>■' 



' >, ? 



US 
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7 -.using dynamic taxonomies to represent user 
'■t^'ti:! profiles of interest and implementation of user 

.< alert for new interesting documents based on 
'^$iMc' : ' such P rof iles of interest. 

The above and other objects and advantages of 
'.f' ; " :M; - the invention, as will appear from the following 
.V -g description, are obtained by a dynamic taxonomy 
, ; .., process as . claimed in Claim 1. Preferred 
,_7'.->y -embodiments and non-trivial variations- of the 
.present invention- are claimed in the dependent 
Claims.. 



'• \ - y^' : 'iis ■ f t . ''' : . i 7,-:.-,-.v'. ; , 




The present invention will be better described 
by some preferred embodiments thereof, given as a 
non-limiting example, with reference to the 
- ' - • enciosed drawing, whose Fig. l shows a block 

rSlllff ■ aqrain ° f the P roc ess of the present invention. 

Before Proceeding with a detailed description 
y^'W the lnvent ion, suitable terminology remarks will 

be made. The set of documents classified under the 
;,;|Si|lt - . taxonomy (corpus) is denoted by U, the universe of 
discourse. Each document d in U is uniquely 
identified by an abstract label called document ID 
of d (DID (d) ) . Each concept c in the taxonomy is 
uniquely identified by an abstract label called 
concept id of c (CID(c)J. Concepts are partitioned 



''■'Vfv'.C <" 
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•Jfi-I int ° term inal concepts (concepts with no- concept 
y?;; V ; Son in the taxonomy) and non-terminal concepts. T 

denotes the set of concepts used in the taxonomy. 
' ■ The taxonomy is usually a tree, but lattices 

-(deriving from a concept having more than one 
father) are allowed. Documents can be classified 
under any (terminal or non-terminal) concept in the 
taxonomy. A specific document d in U may be 
•classified under one or more concepts. The single, 
most general concept in the taxonomy is called the 
root of the taxonomy. This concept need not be 
usually stored in the extension, since it 
represents the entire corpus. 

jgjlSl ' The term " dee P extension" of a concept c 

denotes all the documents classified under c or 
•under any descendant of c. The term -shallow 
extension" of a concept c denotes all the documents 
directly classified under c. 

• ' If c is a incept, C u P(c) denotes the set {c 
union {C ': c« is an ancestor of c in the taxonomy, 
and c" is not the root of the taxonomy}}. c u *(c) is 
computed by the recursive application of operation 
1PI; AI03 (described hereinbelow) . If c is a concept, 
(c) denotes the set {c union {c': c' is a 



''V t .^V/V :«••;:' 



c 



down 




....descendant of c in the taxonomy}}. C down (c) is 



■; ; \yq 00/36529 . • 



mm 



computed by the recursive application of operation 
AI02 (described hereinbelow) . 

With reference to Fig. 1, a block diagram is 
shown of the main steps of the process . of the 
present invention, from which all further 
developments of the process itself originate, such 
developments being described hereinbelow. 

According to the diagram in Fig. 1, the 
process for retrieving information on large 
heterogeneous data bases of the present invention 
comprises the steps of: 

■XlgtjSi: (F1) initially showing a complete taxonomy for 

retrieval; 

(F2) refining the retrieval through a 
selection of subsets of interest, where the 
refining step is performed by selecting concepts in 
f||t : t.he taxonomy and combining them through boolean 
operations; ' 

(F3) showing a reduced taxonomy for the 
selected set; and 

<F4) further refining the retrieval through an 
iterative execution of the refining and showing 
steps.. 



•ik* 
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In addition , to the previously-described 
operations, the following operations can be 
supported: 

a. projection under a given CID of a set S of DIDs : 
: V,,-v • it extracts all the children c of CID such as 



•,; |-|p:gS ,tnere is at least a document in S in the deep 

; : extension of c • 

' b ' ex tractin ? g the CID's -for a specific document d 

^%4?&k$^ in U. 

: $P " & > v ■ . 



. The prior art has. never specified storage 
structures nor the implementation of operations. 




.V^r;<:£ ; that are b oth presented in this context. Abstract 
storage structures are defined with the following 
notation. Given domains Al, AN and Bl, BM: 
• the relation R: [Al, AN] [Bl, BM] 

means that a.N-uple of values drawn from domains 
Al, AN uniquely identifies an M-uple of 



''M:. <l 'i >; ; i 

■ ' ': ''S •''> w.* . • ■ . . 

■ • '' f '. - 

. . '?V^ ; V\ 



values drawn from domains Bl , BM. If [Al, 

! ' </ £ : " ' 

f ■■ .■' . J [B1 ' BM 1 holds, then any [Al, AN] -» 

^iv'S : f||^-, [B1] holds ' where Bi is drawn from any domain in 

> '. . ' set {Bl, BM) 

^t:$§t : - : * the reiation R: [Al, AN] { Bl, BM} 



* f V >' ... 



means that a N-uple of values drawn from domains 
Al, AN uniquely identifies a set of M-uples 
of values drawn from domains Bl, BM. If [Al, 
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AN].* {Bl, ...,-BM} holds, then any [Al, AN] 
.. -> {Bil holds, where Bi is drawn from any domain 

in the set {Bl, BM} . 
■ .When brackets are omitted in the right part, 
square brackets are assumed. 

Abstract relations can be trivially mapped (fox 
the purpose of illustration, and with no intent to 
restrict . their representation) to relations in a 
relational schema, in the following way: 

R: [A1, AN] * [Bl, BM] maps into R( A1, 
AN, Bl, BM) 

V^S R: R:fA1 ' -' ™) ■> <B1, BM} maps into a set of 
4 - • NF relations Ri( Al, „„ AN, Bi ) 

where underlined domains are key attributes of R. 
Abstract SQL queries on these relations will be 
used, to express operations. When expedient, the 
notation A..B applied to an abstract relation [A] * 
[B] or [A] -> {B} will be used to denote the value 
or the set of values of B corresponding to a given 
value of - A. Domain CID holds the abstract labels of 
concepts, i.e. stands for the set of values 
iCID(c), for all c in the taxonomy}. Domain DID 
holds the abstract labels of documents, i.e. 
denotes the set of values <DID(d), for all d in U}. 



liiis 



,. » ft 



-8- 



WO00/36529 




PCT/IT99/00401 



Abstract structures to store the intension 
will now be described. 

The intension is the taxonomy itself; it can 
be seen as a conceptual, schema for a set of 
...corpora.. The intension is stored as: 

*I.S1. One or more "dictionary" relations in the 
form 

Oiz. [CID] -> [textualLabel] 
■coring the user-visible definition of each 
concept; the domain -textualLabel- holds natural 
language descriptions of concepts. Each dictionary 
can be in a different "language", thereby allowing 
multilingual corpora and/or different descriptions 
..of . concepts , 

•;AIS2. a language directory, identifying the 
..appropriate dictionary relation for a specific 
"language" (required only if more than one 

"language" for concept description is used) in the 
form: 

-LD: [LANGUAGE_ID] -> D 

where LANGUAGE_ID holds the abstract identification 
of languages and D holds the existing dictionaries. 

An alternate representation of AIS1, AIS2 is 
by a single relation 

AlSl'r [CID, LANGUAGE_ID] -> textualLabel. 




:.>^p:Si : r PCT/IT99/00401 

'• AIS3, A ^ther to son relation in the form 

'. FS:[CID] -> {SON_CID} 

or 



; : ;:|'S^;fc . FS ' : [CID, SEQ] -» [SON_CID] 
. storin 9' ^r each concept c, its sons in the 

*3^ v . taxonom y- The domain S0N_CID is the same as CID. 

lillS-' ' The domain of SEQ is the set of natural nvmbers - 

•;. i-.'^j-i;; Tne ^cond form, which is generally used, 

;. '• ,|: v allows to supply a meaningful display order among 
the. sons of a concept c. 

AIS4 - A son to fat '^r relation, in the form 
■ ,SF: [CID] { FATHER_C I D } 

;;4|pp:;;.^ -storing, for each concept c, its fathers in the' 
taxonomy. The domain FATHER_Cid is the same as CID. 
If the taxonomy is not a lattice (i.e. any concept 
c can have no more than one father), this relation 



' : •\':4vH' v ' i ': ;> 'v',-V'n 



■ .. I ' \'-yV)v.:A'i;,£* '.H. v' 

■ ^^W?'' becomes: 

s.f-. [cid] [father_cid] . 

"•V :H!®r. In this latter case ' information on the father 

! ^' 5 V ' 3 SpSCifiC c ma V alternatively be stored 

' 'S'J^?^' in the dictionaries as: 



'Hi'. , V - 



Di: [CID] ^ FATHER_CID, textualLabel. 

although this results in redundancy if more than 

one dictionary is maintained. 



-10- 



rC^*'" WO-0W36S29 




PCT/IT99/00401 



'lit ; Abstract stora9e for the exte , slor . 

will now be described. 



The extension represents the ' classification of 
dOCUmSntS - AS ^. depends on the specific 

corpus. The extension is abstracts ro 

c*"5i:ractly represented by 

the following three relations: 
AES1. Deep extension, in the form 
°E:[CID] * {DID} 

-■WMm-, 8 - to ™9*; for each concept c all f ho * 

,,;>Mif'- 811 the documents in 

«. deep extension (that is , all ^ 

: .'I,, CU ""^ e or under any descandant c , ^ 

■■'■■^fe- AES2 - ShalloW -tension, in the form 
,..;:Qg0; SE: [CIO] -> {DID} equivalent to fCID, DID1 

... St ° ring ' f0r " SaCh C ™ c, all th e docuraents in 

;; ".its shallow extension (that i s all , h . 

,: ~ s ' a -~ the documents 

■*-' : ^Mt d " reCtIy Class if^d under C ) . The shallow , ' 

w ' ine shallow extension 

i:Jlp and the deep ext — «» sams for terminal 
' -'V; concept — «* ■»* terainal concepts only 

■V..-V one of DE and SE needs to k , 
, dS t0 De ke Pt (typically, DE 

. will be kept) . 



■ ^v-iM-V-."/- 



? rfjf AES3 - C1 "sification f in the for, 

• .-^Sl r CL: [DID] ■» {ciDt 




storing, f or each doc 

uucumenc, the most specific 

concepts under which it i 

men it is classified. All the 
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ancestors of these concepts can be easily recovered 
through the son-to-father (SF) relation in the 
intension. This structure is required only if the 
display of the classification for stored documents 
is supported at the user level. This storage 
structure is optional, since the set K of concepts 
;-/iv; Und6r WMch a /Pacific DID is stored can be 
synthesized by operation AE05 applied to each 
concept c in T on the singleton set {DID,. A 
concept c is then in K if and only if operation 
AE05 returns TRUE, 
AES4. Document directory 

Not specified, since it depends on the host system. 
It maps a document id into information required to 

r * trieVe the s P e <=ific document (for example, the 
''|iiP%^:! file name) . 

4".J ThS abst "ct implementation of operations on 

' ; the intension will now be described. 

Ill": GiVe " 3 C ° nCept c identified by K=CID(c) r 

find its label in a specific language L. 
1. Access the appropriate language directory 
SELECT D 



•• ■s?;-*»i;Vi.i 

is ■ 



.WHERE LANGU AGE_ I D= L ' 

2.,Use K as a key to access the textual label 
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SELECT textualLabel 
FROM D 
WHERE CI.D=K 

AI02. Given K=CID(c) find all its sons. 

Access the f ather-to-son relation FS, using K as a 

partial key 

SELECT SON_CID 

FROM FS 

WHERE CID=K 

Or 

Access the father-to-son relation FS\, using K as a 

partial key 

SELECT SEQ, SON_CID 

FROM FS' 

WHERE CID=K 

ORDER BY SEQ, SON_CID 

AI03. Given a K=CID(c), find all its fathers. 
Access the son-to-father relation SF ; " using K as a 
partial key 
SELECT FATHER_CID 
FROM SF 
WHERE CID=K 

AI04. Insert, delete, change operations. 

Insert operations are performed by inserting the 

new concept C: 



WO 00/36529 A • pCT/IT* 9/0040, 



^ypy;.; 



•^fffyy 



• in the dictionaries (AIS1) 



• : in the father to son relation 



(AIS3) 



in the son to fatner relation (AIS4) 



fW? ,. " ° iS 3 S ° n ° f an ° ther COnce Pt C, it may be 
1111 \ V USefUl to allow the ««r to reclassify under C some 
° f the docume nts presently classified in the 
sha H° w extension of c. 
.jj- | : In the case in which each concept has a single 
father in the taxonomy, the deletion of a concept C 
is performed by deleting from the intension (AIS1, 
^gy^, AIS3, AIS4) all concepts c e C d0 -(C) . In addition 

ilfefc ° rder t0 aV ° id l 031 ^ documents >' documents 

::>0;r. in the deep extension of C should be added to the 

-yS#<<- : 'y: 

•;||| shallow extension of c, where C is the father of 

y||';;' C in the taxonomy, unless C is the root of the 



P| 'taxonomy. The shallow (AES2) and deep (AES1) 
,y||V| ; extensions for all concepts c € C«™( C > must be 
4|,.;: removed. The concepts in C do -( C ) must be removed 



■ ! i<; .1: 



> ''«■-• .i v ::::.'' 
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from the classification (AES3) of all the documents 
in the deep extension of C. 
^[rffiifi! Alternatively, and in the general case in which 
concepts can have multiple fathers, we proceed as 
follows. 

Define LinkDelete (f , s) as: 

1, remove from AIS3 the instance where CID=CID(f) 

.. >)>&f$ e \ ' and SON _CI D=C ID (s) 

2. remove from A1S4 the instance where CID=CID(s) 
S y and FATHER_CID=CID < f ) 

fB'0§i;- : Def ine Basic-Delete (c) as: 

v ' ;v '; 1- for each f in {f: f is a father of c) call 
: LinkDelete (f, c) 

■■ 2. remove the deep (AES'l) and shallow (AES2) 

•••♦• 4 -V&' ;, '.%r ;k • :: -" ; <- : 



•extension for c, its classification (AES3) , and 
any dictionary entries associated with c. 
Define RecursiveDelete {f , s) as: 
1, if f is the only father of s then 

1,1. for each s l in {s*: s r is a son of s) call 
.' Recur isiveDelete (s, s 1 ) 

^ ^^:/;t i ' l ^ : : , ' 1.2. call BasicDelete (s) 

V^ w vf ' ^- 2, else call LinkDelete (f, s) 

. Define RecomputeDeepExtension (c) as: 
. 1. for each s in {s: s is a son of c) 



\YO 00/36529 
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.1.1. set the deep extension of c: 
DeepExtension (c) = DeepExtension (c) union 
RecomputeDeepExtension (s) 
2. return (DeepExtension (c) ) 
Define Updat eDeepExtension (c) as: 
1. for each f in { f: f is a father of c } 

1.1. DeepExtension (f ) ^DeepExtension (c) union 

ShallowExtension (f ) 
1.2.. UpdateDeepExtension { f ) 
Deletion of c is then implemented as: 

1. Compute the set F(C), which represents all the 
fathers of the concept to be deleted (accessible 
through relation AIS4) . All and only the 
concepts in F(C) and their ancestors will have 
their deep extension affected by the deletion of 
C. 

2, For each s in {s: s is a son of C}, call 
. RecursiveDelete (C, s) 

: 3. Call BasicDelete(C) . 
4. Recompute the deep extension of all the fathers 

of C: for each f in F(C) call 

RecomputeDeepExtension ( f ) 
5 . Update the deep extension of all the ancestors 

of the set F(C) : 



v 4v 
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5.1. For each f in F(C) call 

' ',;./•:. .UpdateDeepExtension (f ) 

•' - Changes in the taxonomy may be of three types: 

.; { ^-? 1. changing the labeling of a concept C: this only 
requires the modification of the textualLabel in 
AIS1 

w 2 > changing the place of a concept C in the 
jM^E-' taxonomy 



3. adding an additional father C 1 to C in the 
taxonomy 

'■-y'y-/ In case 2, let C be the current father of C and 

' ; ^S"/: ■■•C"' the new father of C. First, C must be deleted 
from the taxonomy, and reinserted with C ,! as a 
-i$04^ : ' father. The deep extension of C must be deleted 



from the deep extension of all concepts c e C^fC 1 ) 
^W^4^ (by set subtraction, or by applying the above 
^SSif5|,= : '' . ; 'al<goriti-im for deletion with steps 2 and 3 replaced 
Wy^-py by C reparenting) . The deep extension of C must be 
:'• '.^•V.: / added to the deep extension of all concepts c e 
Kv#x;^ : > c up (C M ) (by set union). No changes in shallow 

extensions are required. 
: '-- v l^';v: ;/ ' . In case 3, the deep extension of C must be 
Vr'.'^;\ added to the deep extension of all concepts c e 



C up {C ) (by set union) 
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The abstract implementation of operations on 
the extension will now be ' described . 
AEOl. Given a concept c such that CID{c) = K, find 
its deep extension. 

Access the deep-extension relation DE, using K as a 
partial key 

SELECT DID 

FROM DE 

WHERE CID=K 

AE02. Given a concept c such that GID(c) = K, find 
its shallow extension. 

Access the shallow extension relation SE, using K 
as a partial key 
SELECT DID . 

iaifr...FROM,SE ' 

WHERE CID=K 



AE03. Test the membership of a set of DIDs {DID} in 
the deep extension of a concept CID. 

1. Retrieve the deep extension of CID 

2. For each d in {DID}, test whether d belongs to 
the deep-extension; if it does, return TRUE; if 
no d in {DID} does, return FALSE 
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AE04. Given a set of DIDs {DID}, count the number 
of documents in .{DID} which are also in the deep 
extension of CID. 

1. -Retrieve the deep extension of CID 

2. Initialize CNT to 0 

3. For each d in {DID}, test whether d belongs to 
the deep-extension; if it does, CNT=CNT+1 

4 . Return CNT 

AEOS; Test the membership of a set of DIDs {DID} in 
the shallow extension of a concept CID. 
As in AE03, by substituting the deep . extension with 
the shallow extension. 

AE06. Given a set of DIDs {DID}, produce the 
projection under a concept CID. 

1. Retrieve the set .{SON} of all the sons of CID 

2. Initialize set R to empty 

3. For each concept s in SON, use operation AE03, 
or operation AE04 if counters are desired, to 
test -the membership of {DID} in s. If the 
•operation returns TRUE (>0 if AE04 is used} add 
s to list R 

4. Return R 

AE07. Given a set of DIDs {DID}, produce the 
reduced taxonomy for {DID}. 
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As a clarification, the set of DIDs for which 
the reduced taxonomy has to be produced can be 
generated by operations on the taxonomy and also by 
any other means, including, without loss of 
generality, database queries and information 
retrieval queries. Also, the current combination of 
concepts can be used as a pre-filter for other 
retrieval methods. 

For performance reason, the reduced taxonomy 
is usually produced on demand: the request only 
displays the highest levels in the tree. The set 
{DID} is kept in memory, so that when the explosion 
of a specific concept in the reduced taxonomy is 
requested, appropriate filtering is performed. 
... 1. Produce the projection of {DID} for the root 
On the subsequent explosion of concept c: 
Produce . the projection of {DID} for c 
The reduced tree can also be totally computed 
;j ' in ( a single step. Let RT be the set of concepts in 
=. the reduced tree. RT can be computed by testing, 
for each concept c in T, the membership of { DID ) 
in c through operation AE03 or AE04 (if counters 
are required) . Concept c is in RT if and only if 
; operation AE03 returns TRUE or operation AE04 
returns a counter larger than 0. 
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The computation can be speeded up . in the 
following way.: 

1. Initialize a table S of size |T|, where S(i] 

. holds information on the current status of 
concept i, initialized at "pending". 

2. Starting from the uppermost levels, and 
continuing down in the tree, process concept i. 

. 2.1. If S[i] is "empty", i does not belong to 
RT r and processing can continue with the next 
concept. 

2.2. If S[i] is not "empty", apply operation 
AE03 or AE04 to i . 

2.2.1. If the operation returns TRUE (AE03) 
or a counter larger than 0 (AE04), i 
belongs to RT. 

2.2.2. Otherwise, neither i nor any of its 
descendants belong to RT: set to "empty" 
all S[jJ in S, such that j is a descendant 
of i in the taxonomy. Descendants can be 
efficiently obtained by keeping a 
precomputed table D, holding for each 
concept in the taxonomy a list of all the 
concepts descending from it in the 
taxonomy (such a table . must be recomputed 
every time the taxonomy changes). 
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AE08. Boolean combination of concepts. 

Boolean combinations of concepts are performed 
through the corresponding set operations on the 
deep extension of concepts. Let c and c' be two 
concepts, and DE(c) and DE(c') their deep extension 
(represented by AES1) : 

c AND c' corresponds to DE(C)oDE(c') 

c OR e' corresponds to DE(c)uDE(c') 

c MINUS c' corresponds to DE (c) -DE (c r ) 

NOT c corresponds to U-DE(c), where U is the 

universe 

AE09. Insertion of a new document. 

The insertion of a new document d (represented by 

DID(d)) classified under a set of- concepts {C\ 

requires the following steps: 

for each c e {C} 

1- insert DID(d) in the shallow extension of c 
(AES2), if c is not a terminal concept and the 
shallow extension must be stored 

2. insert DID (d) in the deep extension (AES1) of 
C u "(c). 

3. insert an item [DlD(d) ] {C } in the 
classification structure AES3 

AEO10. Deletion of an existing document. 
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The deletion of a document d (represented by 
DID (d) ) requires the following steps: 

1. retrieve the set of concepts (C) under which d 
is shallowly classified, by accessing AES3 with 
DID (d) as the key (operation AE02) 

2. for each c € {C} 

a. delete DID (d) from the shallow extension of c 

b. for all C € C up (c): delete DID(d) from the 
deep extension of c 1 

3.. delete the entry corresponding to DID (d) from 
AES3 . 

If AES3 is not. stored, deletion is performed in 
the following way. For each concept c in T, if d 
belongs to the shallow extension of c: 

1, delete DID (d) from the shallow extension of c 

2. for all c' e C up (c) : delete DID (d) from the deep 
extension of c' 

AEQll. Document reclassification. 

Changes in the classification of a document d 
(represented -by DID (d) ) are implemented in the 
following way. Let d be initially classified under 
a concept c (possibly null) and let the new concept 
\knder which d must be classified be c 1 (possibly 
null). If both c and c 1 are non-null, the operation 
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means that d was previously classified under c and 
must now be classified under c'; if c is null, the 
operation means that d is additionally classified 
under c'; if c' is null, the operation means that 
the original classification under c must be 
removed. At least one of c and c' must be non-null. 
If c is not null: 

1. eliminate DID (d) from the shallow extension 
(AES2) of c 

2. eliminate DID (d) from the deep extension (AES1 ) 
of all c" e C up (c) 

3. eliminate c from the classification of d (AES3) 
If c' is not null: 

1. insert DID (d) in the shallow extension (AES2) of 
C (if the shallow extension of c exists) 

2. insert DID(d) in the deep extension (AES1) of 
all c" € C up (c'J 

3. insert C in the classification of d (AES3) 
AE012. Find the concepts under which a document d 
is immediately classified. 

Retrieve {C} from AES3, using DID(d) as a key. 

Physical storage structures, architecture and 
implementation of operations will now be described. 
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As regards -the intension, storage structures 
usually contribute with, a negligible overhead to 
the overall storage cost, since a few thousand of 
concepts, are usually adequate even for semanticalLy 
rich corpora. Storage for these structures may be 
provided by any database management system or any 
keyed access method. The second form of AIS3 (FS») 

y 

• requires an ordered access, since SEQ is used to 
order the sons of a specific concept. Because of 
the low overhead, all the intensional storage 
structures (with the possible exception of AIS1, 
the dictionaries) may be usually kept in central 
memory. 

As regards the extension, the most critical 
component is AES1 (the deep extension) , for several 
reasons. First, deep-extension semantics are the 
natural semantics for boolean combinations of 
concepts (see AE08) . Second, the production of 
reduced taxonomies requires a possibly large number 
Of projections (which are performed on the deep 
extension), whose performance is critical for 
visual operations. 

It is critical that the deep extension of 
concept c is explicitly stored, and not computed as 
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the union of the shallow extensions of all the 
descendants of c. 

Although any dbms or keyed access method can 
be used to provide storage for the deep extension, 
the set of documents in the deep extension can be 
more efficiently represented than by 
straightforwardly mapping the abstract relation. 

The use of fixed size bit vectors in the 
present context will now be described. Information 
data bases with a small-to-moderate number of 
documents can effectively represent the deep 
extension of a concept c by bit vectors, each of 
size equal to |U'|, the maximum number of documents ' 
llfljS: ' ? n the universe - In the bit vector, bit i is set if 
and pnly if the docum ent d with DID(d)=i is in the 
•$M$$t . ^ep extension of c. 

' S-' Set. operations on the deep extension only 



ill 



involve logical operations on bit vectors {AND, OR, 
NOT, etc.). These operations take one or more bit 
vectors and produce a result bit vector of the same 
size. 

Let document id's be numbered 0 to |U'|-1, and 
n be the number of bits in the word of the host 
CPU. For performance reasons, it is better to set 
the fixed size of bit vectors atf|C/'H . in order to 
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- *1. to perform , bit operations ^ ^ 
level, unused bU positions are left ^ 

Counting the nu*er of documents in the result . 

of any operation can be -ef^^i ^ 

■ eff iciently performed by 

-table lookup, in the following way. 

Let the unit of access ua tr,^ 

cess UA necessarily the 

.CPU word, Be . nbits . Build once a vector v of 2 „ 
events, stored in memory , „ hich ^ ^ ^ 

the number of bits set ^ *k u. 

set m the binary number 2\ 0< = 

i <= 2 n -1. 
Counting: 

Initialize counter c at 0; 



■for each chunk 

■store the chunk in i 
set C = c + V[i] 

For access at the octet ^evel ,„= 8) , the 
translation table requires no raor e than 256 octets 
Pot access at the double octet level ,„. l6) . no 

more than 64K octets 

t6tS - Lar 9 er ""its of access are 

not recommended. 

insertion, deletion and reclassification are 
also efficiently performed, by s i mp ly ioeatin, the 
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appropriate deep, and/or shallow extension and 
setting/resetting the appropriate bit. 

This same representation can be trivially used 
for storing structures AS2 and AS3 . In AS3 the si Ze 
of the bit vector is equal to the cardinality of 
the set of concepts in the taxonomy. 

As regards compressed bit vectors, by 
construction, the deep extension is very sparse at 
terminal level, and very dense at the top levels in 
the taxonomy. The use of any type of bit vector 
compression (such as, without prejudice to 
generality, Run Length Encoding (see Capon J., - A 
probabilistic model for run-length coding of 
Pictures", IEEE Trans, on Inf. Theory, 1959) and/or. 
variable-length bit vectors) is therefore 
beneficial. in reducing the overall storage 
overhead, although it introduces a 

compression/decompression overhead. 

If a controlled error-rate in operations is 
acceptable, Bloom filters (see Bloom, B. h. , 
Space/time tradeoffs in hash coding with allowable 
errors, Comm. of the ACM, 1970) can be used to 
represent the deep extension in a compact form, 
suitable for larger information bases. With Bloom 
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filters, counting and set negation are usually not 
supported. . 

For large to very large information bases, a 
bit vector representation (albeit compressed) may 
produce an excessive storage overhead. The deep and 
shallow extensions as well as structure AES3 may be 
stored as inverted lists (see Wiederhold, G., Files 
structures, McGraw-Hill, 1987). Because of 
performance in the computation of set operations , 
such lists (and the result of set operations) are 
kept ordered by document id's, For the above-cited 
statements, it is generally advantageous to use any 
form of inverted list compression. 

As regards the general architectural 
strategies, the implementation of dynamic 
taxonomies should try to keep all the relevant data 
structures in main memory, shared by the processes 
accessing them. 

As noted before, the intension overhead is 
generally negligible so that intensional structures 
(with the possible exception of dictionaries) may 
be usually kept in memory without problems. 

Extension overhead for extensional structures 
is considerably larger. If the storage overhead 
prevents the complete storage of deep-extension 
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structures, buffering strategies should be used, 
such as LRU or the ones described in documents 
■ Johnson, TV, Shasha D. :' 2Q: A Low Overhead High 
.Performance Buffer Management Replacement 
Algorithm, mt. Conf. on Very Large Databases, 
1994; and O'Neill, et al . : The LRU-K Page 
Replacement Algorithm For Database Disk Buffering, 
SIGMOD Conf. 1993 . shallow extensions and 
classification structures are less critical and may 
be kept on disk (again with the buffering 
Strategies described in the two above-mentioned 
documents) 

As indicated in operation AE03, the membership 
test without counting can return TRUE when the 
•first DID common to both lists is found, thereby 
speeding up the computation. 

The use and implementation of. virtual concepts 
will now be described. 

Some "data domains (such as price, dates, 
quantities, etc.) correspond usually to a concept 
(e.g. PRICE) which can be expanded into a large 
number of terminal concepts, each representing a 
specific value (e.g. 100 S,. Such a representation 
causes a high number of son concepts, and increases 
the complexity of the. taxonomy. Alternatively, 



PCT/IT99/00401 



§1 . 





values can be grouped by defining, meaningful 
intervals of values and representing only the 
intervals as specific concepts. This representation 
loses the actual data, and presents the user with a. 
.fixed classification. Grouping may also be combined 
\ with exhaustive representation, but inherits most 
of the problems of both schemes. 

The invention of "virtual concepts" provides a 
third, more flexible alternative. We define a 
"Simple virtual concept" as a concept for which 
neither the actual sons (actual values of the 
domain to be represented) nor the actual extension 
.. are stored, but are computed (usually from 
additional, possibly external data) . 

A virtual concept is completely described by 4 
abstract operations: 

, VI.: .Given a virtual concept v, retrieve all its 
'sons. 

, V2: Given a virtual concept v, retrieve its deep 
extension. 

.. V3: Given the son s of a virtual concept v, 
.. retrieve its deep extension. 

, V4: Given a' document d, find all the terminal 
concepts (descendants of v) under which it is 
stored. 
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On B way implemenUng these ?bstract 

operations is by keen-i™ * 

Y 6eplng ' for virtual concept 

V ' two -abstract relations: 

s v: [value] -> { DID} 

• which stores the set of w„ 

set of documents with a given 

ValUe in th e domain of values of , h 

values of the virtual 

concept. 

C v : [DID] {value} 

" "° h d ~ ha sasingle valueCv; 

:[DID] [value] a »in«i a o 

J - single C v relation may store 

. multiple domains and be sh*r h u 

t>e shared by many virtual 

-ncep ts: in this case c>; [md] ^ 

■ — - va lueI denotes the set of values fo ; . 

— inl . It is important to ^ ^ neit ^ ^ 

; n °r C v need to be «»v«n„-4-i 

explicitly stored, but they can 

be' also synthesized by oueri^ 

y queries °n external data. 

These two abstract relation. 

relations can be 

represented by a <i-i™i^ 

*y single relation in a relational 

-schema (without loss of generality * 

generality and simply to 

provide a clpar rio=.~ • ^. 

clear description of operations) 

C v (DlD / _j/alue) 

with underscored attrihn^o 

attributes representing the 

Primary keys. S v actually stores th« i 

j-y scores the inversion of c v 
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and win usually be represented by a secondary 
index on c v , rather than by a base relation. 

>;,..• With thi * "P^sentation, the abstract 
■'operations defined before oan be easily i ra pl em ent e d 
by SQL queries: 

VI: Given a virtual concept v, retrieve all its 
'sons : 

: SELECT DISTINCT value 
FROM C v 

V2: Given a virtual oonoept v, retrieve its deep 

.extension: 

SELECT DISTINCT DID 

FROM C v 

V3: Given the son s of a virtual concept v, 
Btrieve its extension (s is a terminal concept, so 

V thfit US dSep and shal l°» extension are the same) 
•SELECT DISTINCT DID 

•FROM C v 
..WHERE valuers 

-Counting is trivially added. 

VA: GiV6n a doCument d < ^nd all the terminal 
concepts (descendants of v) under which it is 
stored 

RETRIEVE DISTINCT value 



FROM C 



V 
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• 'WHERE DID=d 

In general, a virtual concept y can be 
—organized into a sub-taxonomy, i.e. each non- 
terminal son of v represents a set of actual domain 
values. Each son may be further specialized, and so 
on. For instance SALARY can be organized into the 
following taxonomy: 
SALARY 

Low (e.g. <1000) 

Medium (e.g. >=1000 and <10000) 

High (e.g. >10000) 

In this case, the non-terminal descendants of 
v can be stored as derived virtual concepts, i.e. 
virtual concepts referencing the same abstract 
.'relations defined for v, but providing additional 
..-restrictions. In the example, "Low" can be 
characterized by the additional restriction 
v , value<1000, so that operation V3 for Low becomes: 
'V- SELECT DISTINCT DID 
FROM C v 

WHERE value<1000 

Virtual and derived virtual concepts are 
peculiar in that their terminal descendants and 
their extensions are not directly stored but 
computed. In order to represent . them in our 
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.framework, the following abstract relations axe 
" added to the intension: 
AIS5: [CID] -> [conceptType] 

where conceptType designated real, simple virtual 
-and derived virtual concepts. 
/"AIS6: [CID]-»[S C1D ] 

for simple virtual concepts, stores the abstract 
,-relation Sv (which can synthesized be a query) for 

the virtual concept CID 

AI-S7: [CID]->[C CID ] 

for simple virtual concepts, stores the abstract 
relation Cv (which can synthesized be a query) for 

>the virtual concept CID 

• AIS8: [CID] -» [CID', restriction] 
for derived virtual concepts only, identifies the 
... virtual concept to refer to and the additional 

••'restriction. 

The use and implementation of time-varying 
concepts will now be described. 

Time-varying concepts, such as age, can be 
represented by a simple variant of virtual 
concepts. A time instant t is represented as an 
abstract "timestamp". The timestamp contains the 
number of clock ticks starting from a fixed time 
origin; the clock resolution depends on the 
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^...application. All timestamps use. the dame time 
coordinates. The difference between two timestamps 
t and t' defines the time interval amplitude 
: ?-';.y . between the two times. Let the values of the 
virtual concept v be the set of timestamps of all 
documents in the extension of v, and let T be the 
"f. tilnestam P of the current time, and the sons of v be 
• represented as time intervals with respect to the 
current time stamp T: 

Given a virtual concept v, retrieve all its sons: 
SELECT DISTINCT T-value 
: FROM C v 

Given a virtual concept v, retrieve its deep ' 
:,; • extension: 
0 SELECT DISTINCT DID 
|$v FROM C v 

i£ Given the son s of a virtual concept v, retrieve 
V ■ its extension 

SELECT DISTINCT DID 

t from c v 

g ; WHERE value=T-s 

Alternatively, and more efficiently, the 
. values °f tne time-varying concept can be split 
:'..;# v . :;-into N intervals (from more recent to older), which 
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are stored as real concepts. In addition, for each 
interval I, we keep: 
. -.a. the list L(I) of DIDs in the interval ordered by 
. decreasing timestamps (i.e. newer to older) 
• b - in cent "l memory, an interval representative 
g. IRd): the last DID in the interval together 
with its timestamp 

c - a classification criterion (e.g. T-value less 
than 1 week and no smaller than 1 day) 
Since the classification of documents varies 
It;^ h time ' We need t0 "^compute the classification 

;,W> ,: ,Y 

f ; .^:° f docuraen ts every time tick (arbitrary time 
V: interval selected by the system .administrator, 
g. -typically a multiple of clock resolution), 
Recording to the following algorithm: 
%t each time tick: 
For each interval I 

while IR(i) needs reclassification (i.e/ it fails 
-the classification criterion for I) do 
1 

Reclassify (IR(i) ) ; 

. set as IR(l) the last DID in the ordered list 
a) 

) 

where Reclassify (IR(i) ) i s 
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.Delete IR(I) .DID from I 

For(i=i+l to N) 

{ 

if IR(I).timestamp meets the classification 
.criterion for interval i 

, . { 

insert IR(i) i n interval i 
break; 

} 

..}. 



:,, Binding a dynamic taxonomy to a database 
; system will now be described. 
:| ; The present invention allows to use a dynamic 

| .taxonomy to browse and retrieve data stored in a 
^conventional dbms (relational, object-relational, 
■■object-oriented, etc.). The invention covers data 
| stored as a single relation (or object) or, ' more 
|,^nerally, represented by a single view on the 
^database (see Elmasri, Navathe, Fundamentals of 
K database systems, The Ben jamin/Cummings Publ . Co., 
.. 1994) . 

>.. In this case documents correspond to tuples 
,• (or rows, records, objects) in the view V. m order 
to identify a document we can either use the 
primary key of the view as a document identifier 
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(DID) or keep, two abstract relations mapping 
system-generated DID's to and from the primary key 
v PK of the view: 
4 -DK: [DID] -> [PK] 

IDK: [PK] -> [DID] 
,i . i Where PK represents the primary key of the 
relation. DK is used to access a tuple of V, given 
:Kj- a document id DID, and IDK is used to retrieve the 
|v.;y;3 ^document id corresponding to a specific value in 
y :■; ■'[ the primary key of V. This latter representation is 
- beneficial when primary keys PK's are large (e.g. 
.when they are defined on alphanumeric attributes) . 

Given a view V we can construct a taxonomy T 
for V in the following way. For each attribute A in 
V, we place a corresponding concept C (A) (either a 
real or a virtual one) as an immediate son of the 
root. Virtual concepts use V itself for the 
synthesis of sons and extensions (as previously 
seen) . Real concepts can be further specialized as 
required by the semantics of A. 

Given a tuple t in V, for each attribute A in 
V, let t.A denote the value of attribute A in t. 
For each real concept C in T (either C (A) or a 
descendant of C(A)), the designer must provide a 
boolean clause B(C, t) such that t (represented by 
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:•. DID (t ) ) is to be classified under C if and only If 
.. B(C, t)=TRUE. . 
r The boolean clause B(C, t) may reference any 

• attribute of t, and consequently, new virtual 
concepts (called "extended concepts") may be 
defined on combinations of attributes by operations 
'■i-p : m the database (including but not restricted to 
'.•(vsums, averages, etc. of database values) . 
: ':. ■ i A special case occurs when the boolean clause 
p^B(C, t) is true when t.A e S c , where S c is a set of 



> -if 



Rvalues of attribute A and S c n S c . = 0, for VC*C . 



ft' 



y, . •., 

^••.^• VIn this case ' it is more efficient to keep a table 
. 'I.. T: [v] ->[c], listing for each value v in domain(A), 
•:Ifev\ the corres Ponding concept c. If s c n S c - * 0, for 
3C*C 1 , multiple concepts can be associated with the 
same value, so that T: [v] -> .{c}. 

In addition to this mapping among attributes and 
v::|^;t :: ' COnceptS/ tne desi g ner may define new concepts 
\-'V- r either as taxonomic generalizations of attributes 
• or extended concepts. 

• New taxonomic generalizations. For virtual 
• concepts, this feature was discussed previously. 
If the sons of a new taxonomic generalization G 
are real concepts (S), no boolean clause is 



WO 00/36529 




PCT/IT99/00401 



i usually required for G, because classification 
,' under G is automatically performed by operation 
. .. AE09 . 

* .Extended concepts. New concepts may be derived 
either as real or virtual concepts by operations 
on the database (including but not restricted to 
sums, averages, etc. of database values) . 
.■■ Binding is then performed in the following way. 
Virtual concepts do not require any special 
ipjprocessing, since they are realized by operations 
|J;C. : °n the database. Real concepts require a 
classification for any new tuple, a deletion if t 
l yr;K. iS deleted or a reclassification if t is changed. 
•||;v*n order to classify t, the system locates the set 
••• y C of concepts for which B(c, t) , ceC is satisfied 
>.; . and classifies t under VcsC (and, consequently 
launder all of c' s ancestors). Deletion and 
^^reclassification are performed as previously 
stated. 
Example: 

Given the relation ' R: (TOWNID, NAME, COUNTRY , 
POPULATION) , we can identify the documents in the 
database by the values of TOWNID. We need to decide 
which attributes will be represented in T and how 
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#\.f- hey Wil1 be represented. Let COUNTRY be 
• : : . ^presented - by a real, concept/ and NAME be 
represented by a virtual concept. In addition we 
, define the real concept CONTINENT as the continent 
%. /.he. COUNTRY is in. CONTINENT can be represented in 

: ;.. ' tW ° Ways; as a taxonomic generalization concept or 
:.■ v- .. as an extended concept. 

-:■ If We re P resent CONTINENT as an extended 
"• concept, the taxonomy T will be- 

I NAME 

- Sv: Select TOWNID FROM R WHERE NAME = x 

Cv: Select DISTINCT NAME FROM R 
CONTINENT 

P||iiEUROPE t.CQONTRY=«Italy« or t . COUNTRY^" France" or ... 
AMERICA t.COUNTRY="USA" or ... 
ASIA t.COUNTRY=..- 
:-,>s;/-.' COUNTRY 



Italy t.COUNTRY="Italy" 
France t . COUNTRY^" France " 
Usa t .COUNTRY="USA" 



If we represent CONTINENT as a taxonomic 
generalization of COUNTRY, the taxonomy T' will be: 



NAME 



Sv: Select TOWNID FROM R WHERE NAME = 
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Cv: Select DISTINCT NAME FROM R 
CONTINENT 



EUROPE 

Italy 
France 
AMERICA 
Usa 

ASIA 



t. COUNTRY=" Italy" 
t.COUNTRY=" France" 



COUNTRY 



Italy 

France 

Usa 



t.COUNTRY=" Italy" 
t.COUNTRY=" France" 
t.COUNTRY="USA" 



U . In both cases, NAME is represented in the same 
|>-ay. For NAME, we have two abstract relations 

W-Sv: .[COUNTRY] ->{TOWNID} 

..' Cv; [TOWNID] -> [COUNTRY] 

POPULATION is represented in 

icpj-esentea m an analogous way. 

; Finally, the use of dynamic taxonomies to 

" represent user profiles ^ • <- 

1;' • t^omes of interest and 

^Implementation of a user alert for new interesting 
documents based on dynamic taxonomy profiles, will 
be described. 
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inVenUon C ° nsi ^ in using set-theoretic 
eXPteSSl0nS °" C ° n *^ tP»» optional, additional 
expre ssions , such as information retrieve! quer ie s) 
to describe user interest in speo i fic topios . Such 
expressions may be directly entered by the user or 
^transparently and automatically captured by ^ 

SySten ' by m ° nit0ri ^ user query/browsing. The 
J< specification of user - profiles U especially 

Important ^ ele """~ — c. and information 
| Bering and in monitoring dynamic data sources in 

F OTder t0 3dViSe US " S ° f or changed relevant 

.information. The information base is assumed to be 
^Classified through dynamic taxonomies. 

IV SCen " i0 " " f °"ows. several users 

i .express their interests through possible multiple 
||^cept U al expressions, called „ inteM8t 
|gPecification s ». A monitoring system accepts these 
J^uests ( „ith an abstract user -.address" to send 

alerts to) . The monitor™ svste m ,1 
.S ' r g svst em also monitors an 

^V-fo^ation base for changes (insertion, option, 
l^nge). The information base is desCribed by ^ 

Saxae taxonomy used by users t„ 

Y users to express their 

'"interests. 

When a change occurs in the information base 
.tthe type of change to be alerted for may be 
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specified by users), the system must find the users 
to alert on the basis of their interests. 
• A brute force approach will check all user 
; :: interest specifications exhaustively, and compute 
' whether each changed document d satisfies any given 
. 'specification S. We can test whether a document d 
, ; satisfies a specification S by applying the query 

*/i - -$\ . . • • 

% ^specified in S to the singleton set {d} and test if 
lf/.f*:&.,l<s retrieved. However, this strategy requires to 
perform, for each information base change, as many 
••■ queries as there are user specifications and may be 
sp^.quite expensive in practice. For this reason, we 
^define alternate strategies which reduce the number 
• : '0f evaluations required. 

We are primarily interested into the efficient 
^solution of dynamic taxonomy specifications. 
|? - 'Additional expressions, such as information 
..retrieval queries, will usually be composed by AND 
W.,^.- ith taxoriomic expressions, and can therefore be 
^v : f:: Solved, if required, after the corresponding 
taxonomic expression is satisfied. 

We will start from the simplest case, in which: 
\v^v# -the specification is expressed as a conjunction 
of terminal concepts; 
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b) documents are classified under terminal concepts 
only. 

As regards conjunctive specifications and 
document classification under terminal concepts 
only, we use two abstract storage structures: 
;:|^jilv. a directory of specifications, in the form: 
|?|^D::, [SID] -> [N, SPEC] 

i^ .:'W/here SID is an abstract identifier which uniquely 
' identifies the specification, SPEC is the 
^•ii^ecification itself (optional), N is the number of 
; concepts referenced in the specification. 

•si'''''. 

ilg^ptionally, other fields (such as the user 
,;,g:^address") will be stored in this structure. 

a specification "inversion", in the form: 
T [CID] -> {SID} 

|^t<|'i:sting for each concept c (represented by its 
•'V" concept identifier). all the specifications 
(represented by their specification id) using that 
/^.•.concept . 

When a specification is created, its abstract 
V -identifier is created, its directory entry is 

;.V: •' 

'>■&: treated in SD and the set of concepts referenced in 

.< < : ' • fehe specification are stored in the inversion SI. 

When a document d is inserted, deleted or 

.changed, let C be the set of concepts (terminal 
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concepts by assumption) under which d is 
classified. The set of specifications that apply to 
d are then found in the following way. 

Let K be the set of concepts used to classify 
^•document d. For each concept k in K, let SID(k) be 
•f;^;the- list of specifications for k (accessible 
through relation SI) ordered by increasing 
specification id's. We define MergeCount (K) . as the 
fesiet composed of pairs (SID, N) such that SID is in 
Vy MergeCount (K) if SID belongs to a SID(k), k in K. 

ft*:';', ■ 

ig ;. If the pair (SID, N) is in MergeCount (K) , N counts 
Sji^he number of SID(k) referencing SID. MergeCount (K) 
|P'v,.canv be produced at a linear cost, by merging the 
|f .SID(k) lists. 

Md/r Let S be a set initially empty, which 
/^'represents the set of specifications satisfied by 

For each pair (SID, N) 
||..^'- retrieve SID.N from SD; 

if SID. N=N: S=S union SID 
|| ■, , As regards specifications using unrestricted 
|^;; r $et operations, let S (represented by SID(S) ) be a 
:v specification. Transform S into a disjunctive 
normal form (i.e. as a disjunction of 
■conjunctions) . Let each conjunctive clause in S be 
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called a component of S. We denote by SIDi(S) the 
i-th component of S. 

Store the directory of specifications as two 
•abstract relations: 

• >v ;SD * as befor e, with N omitted) 

;;: , : : SeD: [COMPONENT] [SDI, N] , where COMPONENT 

stores components of specifications, COMPONENT . SDI 
:J,i Represents the specification id of the 
; -^Specification S of which COMPONENT is a component, 
r:|-and COMPONENT. N is the number of concepts 
l^l^eferenced in the component. 



, : The specification inversion is stored as: 

. : SI: " [CID] -> { COMPONENT } , where CID is a concept 
, identifier and. CID. COMPONENT is the set of 
-^components referencing the concept identified by 



Let K be tne set of concepts used to classify 
document d, for each concept k in K, let 
COMPONENT <k) be the list of. components for k 



■ • ^(accessible through relation SI) ordered by 
• .increasing component id's. Define 

. ComponentMergeCount(K) as the set composed of pairs 
(COMPONENT, N) such that COMPONENT is in 
r, ComponentMergeCount (K) if COMPONENT belongs to a 

' TV. •*' 

, '. COMPONENT (k) , k in K. If the pair (COMPONENT, N) is 



WO 00/36529 




PCT/IT99/00401 



in ComponentMergeCount (K) , N counts the number of 
COMPONENT (k) referencing COMPONENT. 

ComponentMergeCount (K) can be produced at a linear 
* cost, by merging the COMPONENT (k) lists. 

Let S be a set initially empty. 
/ For each pair (COMPONENT, N) , 
... . .retrieve COMPONENT. N through relation SCD; 
*.! : . if COMPONENT. N=N: S-S union COMPONENT • SID 
(COMPONENT • SID is accessed through relation SCD). 
S represents the set of specifications satisfied by 
d. 

As regards specifications and document 
classification under non-terminal concepts to which 
they refer, the specif ication inversion SI needs to 
/be modified in the following way. 

If a specification or component Z references 
concept C, represented by CID(C) then: 
C is a terminal concept: 

CID(C).SID= CID(C).SID union Z, if Z is a 
specification 

CID(C) .COMPONENT^ CID(C) . COMPONENT union 
Z, if Z is a component 
C is a non-terminal concept: 
for each k in C down (C) 
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CID(k).SID= CID(k).SID union Z, if Z is a 
specification 

CID(k) ,COMPONENT= CID(k) . COMPONENT union 
Z, if z is a component 

The set S of satisfied specifications is 
v : ; ; computed as per the previous cases. 

The above-disclosed techniques allow computing 
the specifications satisfied by a document d. In 
case it is desired to determine the specifications 
satisfied by a set of documents D (whose 
^-.cardinality is greater than 1), the above-disclosed 
' techniques can be applied in two ways. In the first 
iV ; way, the techniques are applied without 
^.modifications to every document d in D, then 
^■a- •removing possible duplicate specifications. In the 
second way, K is defined as the set of concepts 
used to classify D, the adequate technique is 
chosen among the described ones and the set S of 
"candidate" specifications is determined. Every 
specification s in S is then checked, performing it 
on D. 



CLAIMS 

1. IWcess for retrieving informat^n on large 
heterogeneous data bases, characterized in 
information retrieval through visual 

queries/searches supported by dynamic taxonomies, 
and further characterized in that said process 
comprises the steps of: 

- (Fl) initially showing a complete taxonomy for 
said retrieval; 

- (F2) refining said retrieval through a selection 
of subsets of interest, said refining step being 
performed by selecting taxonomy concepts and 
combining them through boolean operations; 

- (F3) showing a reduced taxonomy for said selected 
set; and 

- (F4) further refining said retrieval through arc- 
iterative execution of said refining and showing 
steps . 

2. Process according to Claim 1, characterized in, 
that it comprises the following aspects of dynamic 
taxonomies : 

a) additional operations; 

b) abstract storage structures and operations for 
the intension and the extension; 
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y c)/ physical storage structures, architecture and 
implementation of operations; 

definition, use and implementation of virtual 
"^/\ concepts; 

;,ye;) definition, use and implementation of time- 
varying concepts; 
;f) binding a dynamic taxonomy to a database system; 
and 

using dynamic taxonomies to represent user 
profiles of interest and implementation of a 
user alert for new interesting documents based 
on dynamic taxonomy profiles, 

> : t=3-v Process according to Claim 2, characterized in 

t\-r '■ 

•that it further comprises the following operations; 

a) projection under a given CID of a set S of DIDs: 
extraction of all children c of CID such as 

• there is at least a document in S in the deep 
extension of c; and 

b) extraction of CID's for a specific document d in 

u. 

• 4. Process according to Claim 2, characterized in 

• that the intension is stored as one or more 

• "dictionary" relations (AIS1) in the form Di: [CID] 
'\ r> [textualLabel] / said relations storing the user- 
visible definition of each concept; the domain 
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; x >textualLabel" holding natural language 

descriptions of concepts. 
r.,.;5.. Process according to Claim 2, characterized in 
r . tha t the intension comprises a language directory 
(AIS2), said language directory identifying the 
appropriate dictionary relation for a specific 
, . ''language" in the form: 
V: IiD: [LANGUAGE_ID] -> D- 

. :; ;. where LANGUAGE__ID holds the abstract identification 
f.y-Q% languages and D holds the existing dictionaries. 
5 6 - Process according to Claim 4 or 5, characterized 
that an alternate representation of AIS1, AIS2 
j;'xs by a single relation: 

AISI': [CID, LANGUAGE_ID] -» textualLabel . 
7; . Process according to Claim 2, characterized in 
V that the intension comprises an abstract father to 

son relation (AIS3) in the form: 
:', -FS:.[CID] {SON_CID} 
: v'.or - 

■ FS' : [CID, SEQ] -> {S0N_CID} 
storing, for each concept c, its sons in the 
taxonomy, said domain SON_CID being the same as 
CID, said domain of SEQ being the set of natural 
numbers . 
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' -8 . Process according to Claim 2, characterized in 
^•■t'hat the intension further comprises an abstract 
: ,ysoh to father relation (AIS4), in the form: 
:SF: [CID] 4 { FATHER_CID} 

'".storing, for each concept c, its fathers in the 
^.taxonomy, said domain FATHER_CID being the same as 

: 0 t ■;:■}$•-. Process according to Claim 3, characterized in 
.-'that, if the taxonomy is not a lattice (i.e. any 
.^concept c can have no more than one father) , said 
; ;. Abstract son to father relation (AIS4) becomes: 
^l-SF: [CID] -> [FATHER_CID] , 

information on the father of a specific concept c 
^ktxeing able to alternatively be stored in the 
" dictionaries as: 

.. Di: [CID] -> FATHER_CID, ' textualLabel, 

10. Process according to Claim 2, characterized in 
;| ithat, on the intension, the retrieval operation 
(AIOl) of the label of a concept c, identified by 
K=CID(c), in a specific language L, is abstractly 
: implemented, said retrieval operation (AIOl) 
. comprising the steps of: 

accessing , the appropriate language 
directory; and 
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- using K as a key to access the textual 
label. 

•il. Process according to Claim 2, characterized in 
that, on the intension, the retrieval operation 
(AI02) of all the sons of a given K=CID(c) is 
abstractly implemented, by accessing the father-to- 
;son relation FS, using K as a partial key, or by 
accessing the father-to-son relation FS', using K 
;ars a partial key. 
S>12. Process according to Claim 2, characterized in 
V-.-that, on the intension, the retrieval operation 
:.v;.. (AI03) of all fathers of a given K=CID(c) is 
;|v£bstractly implemented, by accessing the son-to- 
»'> ''.father relation SF, using K as a partial key. 
W'l-Sn Process according to Claim 2, characterized in 
^yifoatf on the intension, the insert, delete, change 
if operations (AI04) are abstractly implemented. ' 
>4, Process according to Claim 13, characterized in 
that the insert operations (AI04) are performed by 
inserting the new concept C in the dictionaries 
(AIS1), in the father to son relation (AIS3) and in 
the son to father relation (AIS4). 

15. Process according to Claim 13, characterized in 
•that the deletion operations (AI04) are performed 
by deleting from the intension (AIS1, AIS3, AIS4) 
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all concepts c € C d0Wn (C), the documents in the deep 
extension of C being added to the shallow extension 
of C, where C is the father of C in the taxonomy, 
. i,; unless C is the root of the taxonomy, the shallow 
.... (AES2) and deep (AES1) extensions for all concepts 
c- € C down <C) being removed, the concepts in C down (C) 

V;, I »>" 1 • . 

i3i"v?|: ? ifc>eing removed from the classification (AES3) of all 
y : ;v', 'the documents in the deep extension of C. 

16. Process according to Claim 13, characterized in 
that the change operations (AJ04) are of three 
•^K'"- types; 

changing the labeling of a concept C, said change 
u only requiring the modification of the textualLabel 
i,|S;.ih AIS1; 

• |v-;:v;. - * changing the place of a concept C in the 
taxonomy/ and 

V - adding an additional father C' to C in the 
taxonomy. 

Zl . Process according to Claim 2, characterized in 
.'.'.. that the extension is abstractly represented by a 
:.;V deep extension (AES1), in the form: 

-DE; [CID] -> (DID} 
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storing, for each concept c, all the documents in 
its deep extension, that is, all the documents 
classified under c or under any descendant c f of c. 

18. Process according to Claim 2, characterized in 
t,/, .that the extension is abstractly represented by a 
^• •shallow extension (AES2), in the form: 

SE: [CID] -> {DID} equivalent to [ CID, DID ] 
storing, for each concept c, all the documents in 
t^„its shallow extension, that is, all the documents 

/):j,:l!'V\V ' • . • 

I^Kidirectly classified under c. 

19. Process according to Claim 2, characterized in 
that the extension is abstractly represented by a 

^^classification (AES3) , in the form: 
y %< CL: [DID] {CID} 

storing, for each document, the most specific 
•j;;. concepts under which it is classified, all the 
-ancestors of these concepts being recovered through 
the son-to-father (SF) relation in the intension. 
: ,' 20. Process according to Claim 2, characterized in 
/;;•, that the extension is abstractly represented by a 
'document directory (AES4) mapping document id's 
into information required to retrieve the specific 
document . 

21. Process according to Claim 2, characterized in 
that, on the extension, a retrieval operation 
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v* ( ; AE01) of the deep extension of a concept c, such 
v that CID(c) = K, is implemented, said retrieval 
'- .operation (AE01) comprising the step of accessing 
. the deep-extension relation DE, using K as a 
:;r} partial key, 

'•^22- Process according, to Claim 2, characterized in 
that, on the extension, a retrieval operation 
A]£02 ) of the shallow extension of a concept c, 
VJ- ;• such that CID(c) = K, is implemented, such 
^'• .retrieval operation (AE02) comprising the step of 
'^•accessing the shallow extension relation SE, using 

;K as a partial key. 
S|02 l 3. Process according to Claim 2, characterized in 
'•' ■.that on the extension, a testing operation (AE03) 
[ ? /? is implemented for testing the membership of a set 
of DIDs {DID} in the deep extension of a concept 
^CID, said testing operation (AE03) comprising the 
-</v';;eteps of: retrieving the deep extension of CID; 

and, for each d in {DID}, testing whether d belongs 
$:A r 'tp t the deep extension; if it does, return TRUE; if 
>f no d in {DID} does, return FALSE . 

.... .-'2.4 • Process according to Claim 2, characterized in 
that, .on the extension, a counting operation (AE04) 
' is implemented for counting, given a set of DIDs 
{DID} , the number of documents in {DID} which are 
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■iv ■ 



' also in the deep extension of CID, said counting 
operation (AE04) comprising the steps of: 
•.retrieving the deep extension of CID; initializing 
''•CNT to 0; for each d in {DID}, testing whether d 
•belongs to the deep-extension; if it does, 
•:CNT=CNT+1; returning CNT. 
|v2'5. Process according to Claim 2, characterized in 
"that; on the extension, a testing operation (AE05) 
p|ij^;i:;:\is implemented for testing the membership of a set 
'.of DIDs {DID} in the shallow extension of a concept 
;v||;UgiD, said testing operation (AE05) comprising the 
W:?^y : ;steps of: retrieving the shallow extension of CID; 

and, for each d in {DID}, testing whether d belongs 
: A : 4 " ; ' ; >'';; to the shallow extension; if it does, return TRUE; 
if no d in {DID} does, return FALSE. 

Process according to Claim 2, characterized in 
^^H4a't, oh the extension, a producing operation 
(AE06) is implemented for producing, given a set of 
DIDs {DID}, the projection under a concept CID said 
.producing operation (AE0.6) comprising the steps of: 
retrieving the set {SON} of all the sons of CID; 
for each concept s in SON, using operation AE03, or 
operation AE04 if counters are desired, to test the 
membership of {DID} in s; if the operation returns 
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TRUE (>0 if AE04 is used) adding s to list R; 
■;retur.ning R. 

,* V27. Process according to Claim 2, characterized in 
that, on the extension, a producing operation 
( : RE07 ) is implemented for producing, given a set of 
K ;oD£-Ds {DID}, the reduced taxonomy for {DID}, said 
; ^5|roducing operation (AE07) comprising the steps of: 
• .producing the projection of {DID} for the root; 

and, on subsequent explosion of concept c, 

S^i r ' : .-- v.- 

' • producing the projection of { DID} for c. 
•ff|f JZ'SB. . Process according to Claim 2, characterized in 
#/that the boolean operations on concepts are 
v implemented through the corresponding set 
/' V.' operations on the deep extension of said concepts 
• , CAE08 ) . 

>&-2.9. Process according to Claim 2, characterized in 
Iflf^thafc, on the extension, an insertion operation 
' :(^E09) of a new document is implemented, said 
'. 5 insertion operation (AE09) comprising the steps of: 
./ '•for each c e {C}, inserting DID (d) in the shallow 
extension of c (AES2), if c is not a terminal 
: concept and the shallow extension must be stored; 
.inserting DID (d) in the deep extension (AES1) of 
C up (c); inserting an item [DID (d) ] -> {C} in the 
classification structure AES3. 
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30. Process according to Claim 2, characterized m 
®^:^%-that., on the extension, a deletion operation 
• V '.'(' AEOl 0 ) of an existing document is implemented, 
said deletion operation (AE010) comprising the 
■■^'iste.p.s of: retrieving the set of concepts {C} under 
which d is shallowly classified r by accessing AES3 

f$^£tykth DID (d) as the key (operation AE02) ; for each c 

m^-rn- ■ - 

• e {C ) , deleting DID (d) from the shallow extension 
of c; ■ -for all c' € C up {c), deleting DID (d) from the 

* a? 



deep extension of c } ; deleting the entry 
corresponding to DID (d) from AES3, 

;31 v Process according to Claim 2, characterized in 
that, on the extension, a document reclassification 
^fvoperatipn (AEOll) is implemented, said 

'•''■■.j.-V.'ih'''.- • 

fe^; : rec,lassif ication operation (AEOll) comprising the 

life . 

|r&^s:tep : s of: let d be initially classified under a 
concept c (possibly null) being c' the new concept 
:: -ih^r;^nder which d must be classified (possibly null) / 
.'f; 1 ;:'.:' if both c and c ! are non-null, classifying d under 
; . c' / if c is null, additionally classifying d under 
^v^'c 1 ;. if g' is null, removing the original 
votive lassification under c; if c is not null, 
eliminating DID (d) from the shallow extension 
\ x -{AES2 ) o,f c; eliminating DID (d) from the deep 
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lift:- ' ' 

} ^.extension (AES1) of all e M e C up (c) / eliminating c 
/ from the classification of d (AES3); if c' is not 
^ null, inserting DID (d) in the shallow extension 
ft^;(£ES2) of c' (if the shallow extension of c 
p?::>,:exists) ; inserting DID (d) in the deep extension 
j msi) of all c" e C up (c'); insert c' in the 
/'"-' classification of d (AES3 ) . 

/ 32. Process according to Claim 2, characterized in 
rv; that, on the extension, a finding operation (AE012) 
i :; , ; ,is implemented for finding the concepts under which 
M;,/ a document d is immediately classified, said 
^finding operation (AEQ12) comprising the step of 
V ' retrieving (C} from AES3, using DID (d) as a key. 
i^';.; ,3.3. Process according to Claim 2, characterized in 
that the deep extension (AES1), the shallow 
extension (AES2) and the classification (AES3) are 
: represented through' bit vectors, said bit vectors 
/being compressed or not. 
34 . Process according to Claim 33, characterized in 
|;- that the counting of documents in the result of 
: { logic operations on bit vectors is performed 

t ' " 

, through a constant table V v/hose size is 2 n , whose 

v; element V[i] contains the number of bits at 1 in 

it ' . ... ' ' 

|% : ^t)iTia!ry number I, and processing the bit vector n 
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bits at a time, adding to the counter for every 
;V ; '. ; group j of n bit, whose binary value is v 1 , the 
:;<|$^mount V [v f ] . 

•V 3 : 5. Process according to Claim 2, characterized by 
&y ,; *h*e.. representation of deep extension (AES2) through 
%^®&®6m filters* 

;^^3;6. Process according to Claim 2, characterized in 
that the deep extension (AES1), the shallow 
^|J ; : .;e : xtension (AES2 ) and the classification (AES3) are 
' represented through inverted lists, said inverted 
l^^i'Sts being compressed or not. 

Process according to Claim 2, characterized by 

r - 

the use of. buffering strategies to manage data, 

38. Process according to any one of the previous 



: ;:V, Claims, characterized in that a simple virtual 



^^oncept is completely described by four abstract 
-operations : 

I;.,.". VI.: . given a virtual concept v, retrieve all its 

^;ys;ons; 

V2: given a virtual concept v, retrieve its deep 
extension; 

f}^3.; : given the son s of a virtual concept v, 
M±0$$,r±eve its deep extension; and 
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fyVf : ' given a document d, find all the terminal 
concepts (descendants of v) under which it is 
:••-•< stored. 

Process according to Claim 38, characterized in 
i>\!;./,that, in order to implement said abstract 
^^^erations (VI, V2, V3, V4) , two abstract relations 



' "i-agr-eMkept; for each virtual concept v: 
:S^: [value] {DID} 
;v ";' v ^which stores the set of documents with a given 
Rvalue in the domain of values of the virtual 
concept; and 

: [DID] -> {value} 
:gMlqh stores- the set of values for a specific 
document, where S v represents the inversion of C v 
=|#nd !; can therefore not be explicitly stored. 
§§$; vJ^Q.. Process according to Claim 39, characterized by 
implementation of operations VI, V2, V3, V4. 
. / : 41J Process according to Claim 39, characterized in 
^Vfcftat a derived virtual concept is described by the 
[J '^same structures as of the simple virtual concepts, 
rS^mth additional restrictions. 

) 

' Process according to Claim 2, characterized in 
/••"•'•that the intension further comprises the abstract 
relation (AIS5) : 
'. '.[CID] [concept Type] 
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Sills 



where conceptType designates simple virtual 
concepts and derived virtual concepts. 
.43. Process according to Claim 2, characterized in 
that the intension further comprises the abstract 
Relation (ATS 6) : 
;:(GID]->[S CID ] 

which, for simple virtual concepts, stores the 
abstract relation Sv for the virtual concept CID. 

Process according to Claim 2, characterized in 
that the intension further comprises the abstract 
V ■} ;•>;' relation (AIS7) : 

:;|cfD]^{Cci D } 

■which, for simple virtual concepts, stores the 
■abstract relation Cv for the virtual concept CID, 

45. Process according to Claim 2, characterized in 
rthat the intension further comprises the abstract 
Relation (AIS8) : 
\[CID]->(CID f , restriction] 

.which, for derived -virtual concepts only, 
^.identifies the virtual concept to refer to and the 
^additional restriction. 

46. Process according to any one of the previous 
Claims, characterized in that the' time-varying 

• 4 concepts, whose value is represented by abstract 
timestamps, can be represented by virtual concepts, 



mm 



5f\ 4 >4 

0 
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: ^..representing with T the timestamp value of the 
^current time. 

iMffi" !process according to Claim 46, characterized in 
*?f|J the sons Qf a time-varying concept V, 

represented as a virtual concept, are retrieved 
. .. through the abstract query: 
•^Select distinct T-vaiue 

FROM C v 

''where T is the current time. 
||||8... Process according to Claim 4 6, characterized in 
;|||tat the deep extension of a time-varying concept 

■«« i"",-;-:'.'*^ l ; ': 

,:,.' f , ; t, represented as a virtual concept, is retrieved 
| through the abstract query: 
SELECT DISTINCT DID 

...FROM Cy. 

49. Process according to Claim 46, characterized in 
Ighat the extension of the son s of a time-varying 
!p!P«cept t, represented as a virtual concept, is 
$ gffcrieved through the abstract query: 

SELECT DISTINCT DID 
;. : FR0k C v 

; . {WHERE value=T-s 

s ; -Where T is the current time. 

• C50^ .Process according to any one of the previous 
^ aims ' characterized in that the time-varying 
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concepts can be split into N intervals (from more 
recent to older) , which are stored as real 



m : . Process according to Claim 50, characterized in 
that, for each interval I, we keep: 

a. the list L(I) of DIDs in the interval ordered 



b. an interval representative IR(I): the last DID 
t. in the interval together with its timestamp ' 
v \ c. a classification criterion for the interval. 
• ?5-2. Process according to Claim 50 or 51, 
^characterized . in that the classification of 
• dbcuments is periodically re-computed, according to 

the following algorithm: 
v -For each interval I 

Awhile IR(I) needs reclassification do 



concepts . 



by decreasing timestamps 



Reclassify (IR(I) ) ; 



set IR(I) = the last DID in the ordered list 



: ,a) 



where * Reclassify (IR( I) ) is 



.Delete IR(I).DID from I 



For(i-i+l to N) 
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if IR(I) meets the classification criterion 
'^&r interval i 

insert IR(l) in interval i 
break; 

) 

|:- 53. Process according to any one of the previous 
p.g.:Qlaims, characterized in that a dynamic taxonomy is 
^.../used to represent data stored as a single relation 
||;i/::;k>r object) or, more generally, represented by a 
,;. .-single view on the database. 

P^&4.- Process according to Claim 53, characterized in 
that the documents correspond to tuples (or rows, 
records, objects) in the view V, and, in order to 
.•^identify a document, we can either use the primary 
lV"' 'key of the view as document identifier (DID) or 
i&5 k ^ ep two abstract relations mapping system- 
igenerated^ DID's to and from the primary key PK of 
.' the view: 

DK: [DID] -> [PK] 
. IDK: [PK] -> [DID] 

where PK represents the primary key of the 
relation, using DK to access a tuple of V, given a 
. document id DID, and using IDK to retrieve the 
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•" .document id corresponding to a specific value in 
'N-vthe primary key of V. 

•'55. Process according to Claim 54, characterized in 
V that, given a view V, we can construct a taxonomy T 
for V through the following steps: 

for each attribute A in V, placing a 
^corresponding concept C(A) (either a real or a 
fjf virtual one) as an immediate son of the root, 
'virtual concepts using V itself for the synthesis 
!:^|;*;-pf :sons and extensions; 

■tffj.'---. given a tuple t in V, for each attribute A in V, 
1:. t.A denoting the value of attribute A in t, for 
each real concept C in T (either C (A) or a 
.descendant of C(A)), providing a boolean clause 
;B(G, t) such that t (represented by DID (t) ) is 
classified under C if B(C, t)=TRUE, said boolean 
'clause B(C, t) referencing any attribute of t. 
s ||S; : 5 : 6i: Process according to Claim 55, characterized in 
' ' • . that , when the boolean clause B(C, t) is true when 
iX' t.A e S c , where S c is a set of values of attribute 
;/."... .A and S c - n S c - = 0, for VC*C 1 , a table T: [v] ->[c] 
: is kept, listing for each value v in domain(A), the 
. ; ' : :\ corresponding concept c; if S c P> S c - * 0< Ior 3C?*C' , 
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II 



^•multiple concepts are associated with the same 

/•"•Value, so that T: [v] -> {c} . 

S7. Process according to any one of the previous 
Claims, characterized in that it further comprises 

: / the steps of: 

|-;/- ^creating new taxonomic generalizations, said 
&j#?eating step, if the sons of a new taxonomic 
||^ ; generalization G are real concepts {S}, not 
^requiring any boolean clause for G, the 
^/classification under G being automatically 



pjerformed by the inserting operation (AE09) of a 
new document; and. 
- creating extended concepts. 

58... Process according to Claim 57, characterized in 
' /-that the new concepts are derived either as real or 
virtual concepts by operations on the database, 
-binding among said virtual concepts being realized 
| / by operations on the database, binding among said 
^vreal concepts requiring a classification for any 
r -new tuple, a deletion if t is deleted or a 
^■.reclassification if t is changed, the system, in 
|^#rder to classify t, locating the set C of concepts 
for which B(c, t), ceC is satisfied and 



^-'classifying t under VceC. 
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61. Process according to Claim 60, characterized in 
that/ to realize said monitoring step, dynamic 

J:^^ concepts are used for which additional 

\. expressions, such as information retrieval queries, 
.• are composed by AND with taxonomic expressions, 
|^a : nd are solved, if required, after the 
^.corresponding taxonomic expression is satisfied. 

62. Process according to Claim 61, characterized in 
that it can be checked whether a document d 

I;; satisfies a user specification S by applying the 
f»|.^ query specified in S to the set {d} and checking 

whether d is retrieved. 

63. Process according to Claim 62, wherein, if 
;specif ications only comprise conjunction operations 
and document classification is only under terminal 
concepts, two abstract storage structures are used: 
— a directory of specifications, in the form: 
SD: [SID] -> [N , SPEC] 

where SID is an abstract identifier which uniquely 
identifies the specification, SPEC is the 
•specification itself (optional), N is the number of 

concepts referenced in the specification; 

.,V." ' 

p^v'r a specification "inversion", in the form: 
jf^ SI: [CID] -> {SID} 



my* 
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listing for each concept c (represented by its 
" , concept identifier) all the specifications 
-.(represented by their specification id) using that 
concept. 

; v/64. .Process according to Claim 61, wherein, when a 
^'specification is created, its abstract identifier 
lis created, its directory entry is created xn SD 
and the set of concepts referenced in the 
^•.■•.specification are stored in the inversion SI. 
^0);\ 65. Process according to Claim 61, wherein, 
' When a document d is changed, C being the set of 
( concepts under which d is classified, the set of 
^Specifications that apply to d are then found in 
; the following way: K being the set of concepts used 
, ; " to classify document d, for each concept k in K, 
. : SID ( k) is the list of specifications for k 

through relation SI) ordered by 
• ; increasing specification id's; defining 

MergeCount (K) as the set composed of pairs (SID, N) 
v .such that SID is in MergeCount (K) if SID belongs to 
'a SID(k), k in K; if the pair (SID, N) is in 
' • MergeCount (K) , N counts the number of SID(k) 
- referencing SID; if S is an initially empty set, 
which represents the set of specifications 
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.satisfied by d, for each pair (SID, N), retrieving 
SI D.N from SD; if SID.N=N: S=S union SID. 
66. Process according to Claim 61, wherein, when 
;^i;there are specifications using unrestricted set 
i ; . operations, S (represented by SID(S) ) is a 
•specification, and the following steps are 
.'.provided: 

transforming S in disjunctive normal form (i.e. 
fy$0s a disjunction of conjunctions), each conjunctive 
"clause in S being called a component of S, SIDi(S) 
i . denoting the i-th component of S; 

: - storing the directory of specifications as two 
;: •-abstract re la t ions : 

: SD, omitting N 
%£SQD: ■ [COMPONENT] -> [SDI, N] , where COMPONENT 
^-:;:fSt:ores components of specifications, COMPONENT . SDI 
:, • . represents the specification . id of the 
; » •specification S of which COMPONENT is a component, 
;i 3and ' : COMPONENT. N is the number of concepts 
'.referenced in the component; 
/-storing the specification inversion as: 
"•"SI: [CID] -> (COMPONENT), where CID is a concept 
identifier and CID. COMPONENT is the set of 
components referencing the concept identified by 
CID; 
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... - with K being the set of concepts used to classify 
^document d, for each concept k in K, COMPONENT ( k) 
' .is the list of components for k (accessible through 
relation SI) ordered by increasing component id's; 
defining ComponentMergeCount (K) as the set 
||o* P osed of pairs (COMPONENT, N) such that 
£ ^COMPONENT is in ComponentMergeCount (K) if COMPONENT 
Ipongs to a COMPONENT (k) , k in K; if the pair 
|. (COMPONENT, N) is in ComponentMergeCount (K) , N 
Anting the number of COMPONENT (k) referencing 
. COMPONENT; 

; ~ WUh S bein ^ a set initially empty, for each 
||*ir (COMPONENT, N) , retrieving COMPONENT . N through 
relation SCD; 

if COMPONENT . N=N : S=S union COMPONENT . SID, S 
| ^presenting the set of specifications satisfied by 



;6T. Process according to Claim 61, wherein the 
modification of the specification inversion SI 
comprises the steps of: 

,- if a specification or component Z references 
concept C, represented by CID(C) then: 

- if C is a terminal concept: 
;•. CID(C).SID= CID(C).SID union 2, if Z is a 
V ; ...specification 
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• CID(C) .COMPONENT= CID(C) . COMPONENT union Z, if Z 
is a component 

- if C is a non-terminal concept: 
* for each k in C down (C) 

• CID(k).SID= CID(k).SID union Z, if Z is a 
; specification 

:g'f.. ClD(k) . COMPONENT* CID(k) .COMPONENT union Z, if Z 
-■}':} is a component. 

^•'68. Process according to Claim 61, characterized in 
• ^:fefetiat it further comprises computing the 
•^specification satisfied by a set of documents D 
pwhose cardinality is greater than 1) according, to 
'the strategy of applying the previous techniques 
without modi f i cat ions to every document d in D, 
vV;then removing possible duplicate specifications, 

. Process according to Claim 61, characterized in 

<!.*'.■ ^li: 

^v.%hat it further comprises computing the 
^;'|S:jpecif ications satisfied by a set of documents D 
: ofwhose cardinality is greater than 1) according to 
.a strategy of: defining as K the set of concepts 
used to classify D; applying the adequate technique 
.among the described ones; and determining the set S 
- v.Qf "candidate" specifications, every specification 
s in S being then checked by performing it on D. 
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V 70 ' Process according to Claim 27, characterized in 
■j.j: that the reduced taxonomy is totally computed in a 
single step, wherein RT is the set of concepts in 
the reduced taxonomy, RT being computed by testing, 
J^..'fpx each concept c in T, the membership of { DID } 
■) ' ±n C tnr ough operation AE03 or AE04 (if counters 
fx* required), concept c being in RT if and only if 
liberation AE03 returns TRUE or operation AE04 
|; f ,. returns a counter larger than 0. 

•• 71 " Process according to Claim 70, characterized in 

|5>;* hat the computation ' is speeded up through the 
lll'Afeol lowing steps: 

|J : " initializing a table S of size |T|, where S[i] 
holds information on the current status of concept 
^ri, initialized at "pending"; 
Sf'p'' startin g from the uppermost . levels, and 

l .A ' • ' 

continuing down in the tree, process concept I; 
tfe? if SU3 iS "^Pty"' determining that i does not 
belong to RT, and continuing the processing with 
X ;.; ,t.he next concept; 

jj^-r,-. if S[i] is not "empty", applying operation AE03 
or AE04 to I; 

- if the operation returns TRUE (AE03) or a counter 
.larger than 0 (AE04), determining that i belongs to 
RT; otherwise, determining that neither i nor any 
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Of: its descendants belong to RT and setting to 
"empty" all S[j] in S, such that j is a descendant 
of i in the taxonomy. 

-72. Process according to Claim 69, characterized in 
that the descendants are obtained by keeping a 
precompiled table D, holding for each concept in 



frpm it in the taxonomy, such a table being 
recomputed every time the taxonomy changes. 
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